Skip to content

Comments

Add best-effort cleanup to EksCreateNodegroupOperator on post-create failure#61145

Merged
shahar1 merged 1 commit intoapache:mainfrom
SameerMesiah97:61142-EKSCreateNodeGroupOperator-Cleanup
Feb 10, 2026
Merged

Add best-effort cleanup to EksCreateNodegroupOperator on post-create failure#61145
shahar1 merged 1 commit intoapache:mainfrom
SameerMesiah97:61142-EKSCreateNodeGroupOperator-Cleanup

Conversation

@SameerMesiah97
Copy link
Contributor

@SameerMesiah97 SameerMesiah97 commented Jan 27, 2026

Description

Added best-effort cleanup for EKS managed nodegroups to ensure nodegroups are deleted when failures occur after a nodegroup has been successfully created. Cleanup behavior is guarded by a flag and is opted in by default.

Previously, nodegroup creation could succeed via create_nodegroup, but the operator could then fail during post-creation steps (for example, when waiting for nodegroup readiness with wait_for_completion=True and missing eks:DescribeNodegroup permissions). In these cases, the Airflow task failed while the EKS managed nodegroup continued provisioning or running in AWS.

Cleanup logic has now been added to the internal _create_compute helper. If an exception is raised after nodegroup creation during the wait phase, the operator attempts a best-effort deletion of the nodegroup. Cleanup failures are logged but do not mask or replace the original exception.

Cleanup is only triggered for post-start EKS nodegroup failures (including WaiterError), ensuring deletion is attempted only when a nodegroup was successfully created and avoiding interception of non-AWS exceptions.

Rationale

EKS managed nodegroups are external resources whose lifecycle extends beyond the execution of the Airflow task. If nodegroup creation succeeds but subsequent steps fail, Airflow may lose the ability to observe or manage the resource, potentially leaving nodegroups running unexpectedly.

Failures after nodegroup creation can occur for multiple reasons, including partial IAM permissions (for example, allowing eks:CreateNodegroup but denying eks:DescribeNodegroup, which is required by the waiter). In such cases, the nodegroup may continue provisioning even though the Airflow task has failed.

This change applies only to nodegroup creation and does not affect cluster creation, deletion, or Fargate profiles. Cleanup is scoped narrowly to nodegroups created during the current execution and is only attempted when nodegroup creation has already completed successfully. This prevents interference with unrelated resources while avoiding orphaned EKS-managed infrastructure on post-create failures.

Restricting cleanup to post-creation EKS nodegroup failures prevents unintended deletion in unrelated failure paths while still addressing orphaned nodegroups created during execution.

Notes

These series of changes intentionally avoid introducing a shared abstraction for AWS operator cleanup logic. Resource creation, ownership tracking, and cleanup semantics vary significantly across AWS services, and a generic solution would add complexity without clear benefit. Cleanup is therefore implemented locally where behavior and failure modes are well understood.

Tests

  • Added a unit test verifying that nodegroup deletion is attempted when a failure occurs during the wait phase after successful creation.
  • Added a unit test ensuring that failures during cleanup do not mask or override the original exception.

Documentation

The docstring for EksCreateNodegroupOperator has been updated with a brief description of the new flag delete_nodegroup_on_failure.

Backwards Compatibility

A new flag called delete_nodegroup_on_failure has been added to EksCreateNodegroupOperator with a default setting of True. Best-effort cleanup will now be attempted if a post-creation failure (including WaiterError) occurs after the nodegroup has been successfully created.

Closes: #61142

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jan 27, 2026
@SameerMesiah97 SameerMesiah97 force-pushed the 61142-EKSCreateNodeGroupOperator-Cleanup branch 2 times, most recently from 345ab8b to 2ff71f0 Compare January 28, 2026 19:24
occur after successful creation (e.g. waiter failures due to missing
DescribeNodegroup permissions).

This change adds best-effort cleanup when post-create steps fail by attempting
to delete the nodegroup that was successfully created. Cleanup errors are
logged but do not mask the original exception. This mode is opt-in by default.

Tests cover successful cleanup on waiter failure and ensure cleanup failures
do not override the original error.
@SameerMesiah97 SameerMesiah97 force-pushed the 61142-EKSCreateNodeGroupOperator-Cleanup branch from 1ed447c to 26b4853 Compare January 28, 2026 20:19
@shahar1 shahar1 changed the title EksCreateNodegroupOperator could leave nodegroups running after failure Add best-effort cleanup to EksCreateNodegroupOperator on post-create failure Feb 10, 2026
@shahar1 shahar1 merged commit 6ca21bf into apache:main Feb 10, 2026
88 of 89 checks passed
Alok-kumar-priyadarshi pushed a commit to Alok-kumar-priyadarshi/airflow that referenced this pull request Feb 11, 2026
Ratasa143 pushed a commit to Ratasa143/airflow that referenced this pull request Feb 15, 2026
choo121600 pushed a commit to choo121600/airflow that referenced this pull request Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EksCreateNodegroupOperator leaks EKS nodegroup on failure with partial IAM permissions

4 participants